Do penguins with longer flippers weigh more or less than penguins with shorter flippers? You probably already have an answer, but try to make your answer precise. What does the relationship between flipper length and body mass look like? Is it positive? Negative? Linear? Nonlinear? Does the relationship vary by the species of the penguin? How about by the island where the penguin lives? Let’s create visualizations that we can use to answer these questions.
Load the required libraries.
library(tidyverse)
library(palmerpenguins)
library(ggthemes)
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
# ggplot function defines data source and global mapping attributes
ggplot(
data = penguins,
mapping = aes(
x = flipper_length_mm,
y = body_mass_g
)
) +
# geom functions define the plot type and local mapping attributes
geom_point(
mapping = aes(
colour = species,
shape = species
)
) +
geom_smooth(method = "lm") +
# labs adds labels
labs(
title = "Flipper length and body mass",
subtitle = "Dimensions for Adelie, Chinstrap and Gentoo species",
x = "Flipper length (mm)",
y = "Body mass (g)",
shape = "Species",
colour = "Species"
) +
# colour theme from ggthemes package
scale_color_colorblind()
penguins? How many columns?nrow(penguins)
## [1] 344
ncol(penguins)
## [1] 8
bill_depth_mm variable in the
penguins data frame describe? Read the help for
?penguins to find out.?penguins
bill_depth_mm
vs. bill_length_mm. That is, make a scatterplot with
bill_depth_mm on the y-axis and bill_length_mm
on the x-axis. Describe the relationship between these two
variables.ggplot(
penguins,
aes(
x = bill_length_mm,
y = bill_depth_mm,
colour = species
)
) +
geom_point()
Reviewing the scatterplot without colour added for species, there appears to be no correlation between bill length and bill depth. However, when we show the species by colour, we can see that each species appears to have a positive correlation (as bill length increases so does bill depth).
species
vs. bill_depth_mm? What might be a better choice of
geom?ggplot(
penguins,
aes(
x = bill_depth_mm,
y = species
)
) +
geom_point()
This scatterplot shows us that each species has a different range of bill depths but it does not answer the question of a relationship between bill depth and bill length.
ggplot(data = penguins) +
geom_point()
It gives an error because the function geom_point()
requires x and y aesthetics to be defined.
na.rm argument do in
geom_point()? What is the default value of the argument?
Create a scatterplot where you successfully use this argument set to
TRUE.ggplot(
penguins,
aes(x = flipper_length_mm,
y = bill_depth_mm)
) +
geom_point(na.rm = TRUE)
The na.rm argument removes all null values before
creating the plot. By default it is set to FALSE.
labs().# ggplot function defines data source and global mapping attributes
ggplot(
data = penguins,
mapping = aes(
x = flipper_length_mm,
y = body_mass_g
)
) +
# geom functions define the plot type and local mapping attributes
geom_point(
mapping = aes(
colour = species,
shape = species
)
) +
geom_smooth(method = "lm") +
# labs adds labels
labs(
title = "Flipper length and body mass",
subtitle = "Dimensions for Adelie, Chinstrap and Gentoo species",
caption = "Data come from the palmerpenguins package.",
x = "Flipper length (mm)",
y = "Body mass (g)",
shape = "Species",
colour = "Species"
) +
# colour theme from ggthemes package
scale_color_colorblind()
bill_depth_mm be mapped to? And should it be mapped at the
global level or at the geom level?ggplot(
penguins,
aes(
x = flipper_length_mm,
y = body_mass_g,
colour = bill_depth_mm
)
) +
geom_point() +
geom_smooth(
method = "gam"
)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
geom_point() +
geom_smooth(se = FALSE)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_smooth(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
How you visualize the distribution of a variable depends on the type of variable: categorical or numerical.
Use a bar chart.
penguins |>
ggplot(
aes(
x = species
)
) +
geom_bar()
We can also reorder the bars based on their frequencies by transforming the variable to a factor and reordering the levels of the factor.
penguins |>
ggplot(
aes(
x = fct_infreq(species)
)
) +
geom_bar()
A variable is numerical (or quantitative) if it can take on a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. Numerical variables can be continuous or discrete.
One commonly used visualization for distributions of continuous variables is a histogram.
penguins |>
ggplot(
aes(x = body_mass_g)
) +
geom_histogram(
binwidth = 200
)
An alternative visualization for distributions of numerical variables is a density plot. A density plot is a smoothed-out version of a histogram and a practical alternative, particularly for continuous data that comes from an underlying smooth distribution.
Imagine a histogram made out of wooden blocks. Then, imagine that you drop a cooked spaghetti string over it. The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve. It shows fewer details than a histogram but can make it easier to quickly glean the shape of the distribution, particularly with respect to modes and skewness.
penguins |>
ggplot(
aes(x = body_mass_g)
) +
geom_density()
penguins |>
ggplot(
aes(y = species)
) +
geom_bar()
color or fill, is more useful for changing the
color of bars?ggplot(penguins, aes(x = species)) +
geom_bar(color = "red")
ggplot(penguins, aes(x = species)) +
geom_bar(fill = "red")
On a bar plot, the fill aesthetic is more useful for
changing the colour of the bars. The color aesthetic only
changes the border of the bars, whereas the fill aesthetic
changes the whole bar colour.
bins argument in
geom_histogram() do?penguins |>
ggplot(
aes(x = body_mass_g)
) +
geom_histogram(
bins = 25
)
The bins argument sets the number of bars on the
histogram.
carat variable in the
diamonds dataset that is available when you load the
tidyverse package. Experiment with different binwidths. What binwidth
reveals the most interesting patterns?diamonds |>
ggplot(
aes(x = carat)
) +
geom_histogram(
binwidth = 0.01
)
When you use a binwidth of 0.01, you can see the presence of many modes within the dataset.
To visualize a relationship we need to have at least two variables mapped to aesthetics of a plot. In the following sections you will learn about commonly used plots for visualizing relationships between two or more variables and the geoms used for creating them.
Use a boxplot.
penguins |>
ggplot(
aes(
x = species,
y = body_mass_g
)
) +
geom_boxplot()
Alternatively, you could use a density plot.
penguins |>
ggplot(
aes(
x = body_mass_g,
colour = species,
fill = species
)
) +
geom_density(
alpha = 0.5
)
Use stacked bar plots.
penguins |>
ggplot(
aes(
x = island,
fill = species
)
) +
geom_bar()
Alternatively, use a relative frequency plot.
penguins |>
ggplot(
aes(
x = island,
fill = species
)
) +
geom_bar(
position = "fill"
)
Use a scatter plot.
penguins |>
ggplot(
aes(
x = flipper_length_mm,
y = body_mass_g
)
) +
geom_point()
We can incorporate more variables into the plot by mapping them to additional aesthetics (e.g. colour)
penguins |>
ggplot(
aes(
x = flipper_length_mm,
y = body_mass_g
)
) +
geom_point(
aes(
colour = species,
shape = island
)
)
However adding too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. Another way, which is particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.
penguins |>
ggplot(
aes(
x = flipper_length_mm,
y = body_mass_g
)
) +
geom_point(
aes(
colour = species
)
) +
facet_wrap(~island)
mpg data frame that is bundled with the ggplot2
package contains 234 observations collected by the US Environmental
Protection Agency on 38 car models. Which variables in mpg
are categorical? Which variables are numerical? (Hint: Type
?mpg to read the documentation for the dataset.) How can
you see this information when you run mpg?glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
manufacturer, model, trans,
drv, fl and class are
categorical. displ, year, cyl,
cty and hwy are numerical.
hwy vs. displ using
the mpg data frame. Next, map a third, numerical variable
to color, then size, then both
color and size, then shape. How
do these aesthetics behave differently for categorical vs. numerical
variables?mpg |>
ggplot(
aes(
x = hwy,
y = displ,
colour = drv,
size = cty,
shape = fl
)
) +
geom_point()
You cannot map a continuous variable to the shape
aesthetic. When a numerical variable is mapped to colour it
takes on a gradient palette but when a categorical variable is mapped to
colour it takes on a palette of distinct colours.
hwy vs. displ, what
happens if you map a third variable to linewidth?mpg |>
ggplot(
aes(
x = hwy,
y = displ,
linewidth = drv
)
) +
geom_point()
Nothing happens - there is no line to alter the width of, so the code runs as if it wasn’t there.
mpg |>
ggplot(
aes(
x = displ,
y = cty,
colour = manufacturer,
shape = manufacturer
)
) +
geom_point()
bill_depth_mm
vs. bill_length_mm and color the points by
species. What does adding coloring by species reveal about
the relationship between these two variables? What about faceting by
species?penguins |>
ggplot(
aes(
x = bill_depth_mm,
y = bill_length_mm
)
) +
geom_point(
aes(
colour = species
)
)
penguins |>
ggplot(
aes(
x = bill_depth_mm,
y = bill_length_mm
)
) +
geom_point() +
facet_wrap(~species)
Colouring by species reveals clusters of points by species. Each species appears to have a positive correlation.
ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species
)
) +
geom_point() +
labs(color = "Species", shape = "Species")
It yields two legends because only colour was included
in the labs() function. You can fix it by adding
shape to the labs() functions as well.
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "fill")
ggplot(penguins, aes(x = species, fill = island)) +
geom_bar(position = "fill")